Discussion on Hash Tables

Discover more aspects regarding hash tables.

We'll cover the following

Additional notes#

Hash tables and hash codes represent an enormous and active field of research that is just touched upon in this chapter. The online Bibliography on Hashing contains nearly 2000 entries.

A variety of different hash table implementations exist. The one we discussed before in this chapter is known as hashing with chaining (each array entry contains a chain (List) of elements). Hashing with chaining dates back to an internal IBM memorandum authored by H. P. Luhn and dated January 1953. This memorandum also seems to be one of the earliest references to linked lists.

svg viewer

An alternative to hashing with chaining is that used by open addressing schemes, where all data is stored directly in an array. These schemes include the LinearHashTable structure. This idea was also proposed, independently, by a group at IBM in the 1950s. Open addressing schemes must deal with the problem of collision resolution: the case where two values hash to the same array location. Different strategies exist for collision resolution; these provide different performance guarantees and often require more sophisticated hash functions than the ones described here.

Yet another category of hash table implementations are the so-called perfect hashing methods. These are methods in which find(x) operations take O(1)O(1) time in the worst-case. For static data sets, this can be accomplished by finding perfect hash functions for the data; these are functions that map each piece of data to a unique array location. For data that changes over time, perfect hashing methods include FKS two-level hash tables and cuckoo hashing.

The hash functions presented in this chapter are probably among the most practical methods currently known that can be proven to work well for any set of data. Other provably good methods date back to the pioneering work of Carter and Wegman, who introduced the notion of universal hashing and described several hash functions for different scenarios. Tabulation hashing, described before, is due to Carter and Wegman but its analysis, when applied to linear probing (and several other hash table schemes), is due to Paˇ\check{a}tras¸\c{s}cu and Thorup.

The idea of multiplicative hashing is very old and seems to be part of the hashing folklore. However, the idea of choosing the multiplier zz to be a random odd number, and the analysis we done in multiplicative hashing is due to Dietzfelbinger. This version of multiplicative hashing is one of the simplest, but its collision probability of 2/2d2/2^{d} is a factor of two larger than what one could expect with a random function from 2w2d2^{w} \rightarrow 2^{ d} . The multiply-add hashing method uses the function

h(x)=((zx+b)mod22w)div 22wdh(x) = \left(\left(zx^{} + b^{}\right) \mod 2^{2w}\right)div \ 2^{2w-d}

where zz and bb are each randomly chosen from {0,...,22w1}\left\{0,...,2^{2w}−1\right\}. Multiply-add hashing has a collision probability of only 1/2d1/2^{d}, but requires 2w2w-bit precision arithmetic.

There are a number of methods of obtaining hash codes from fixed-length sequences of ww-bit integers. One particularly fast method is the function

h(x0,,xr1)=(i=0r/21((x2i+a2i)mod2w)((x2i+1+a2i+1)mod2w))mod22wh(x_0,\dots,x_{r-1})= \left(\sum_{i=0}^{{r/2}-1} \left(\left(x_{2i} + a_{2i}\right) \mod 2^{w}\right)\left(\left(x_{2i+1} + a_{2i+1}\right) \mod 2^{w}\right)\right) \mod 2^{2w}

where rr is even and a0,...,ar1a_{0},...,a_{r−1} are randomly chosen from {0,...,2w}\{0,...,2^{w}\}. This yields a 2w2w-bit hash code that has collision probability 1/2w1/2^{w}. This can be reduced to a ww-bit hash code using multiplicative (or multiply-add) hashing. This method is fast because it requires only r/22wr/2 2w-bit multiplications, whereas the method we described in hash codes for compound objects requires rr multiplications. (The mod operations occur implicitly by using ww and 2w2w-bit arithmetic for the additions and multiplications, respectively.)

svg viewer

The method of using polynomials over prime fields to hash variable-length arrays and strings is due to Dietzfelbinger. Due to its use of the mod operator which relies on a costly machine instruction, it is, unfortunately, not very fast. Some variants of this method choose the prime pp to be one of the form 2w12^{w} − 1, in which case the mod operator can be replaced with addition (+)(+) and bitwise-and (&) operations. Another option is to apply one of the fast methods for fixed-length strings to blocks of length cc for some constant c>1c > 1 and then apply the prime field method to the resulting sequence of r/c\lceil r/c \rceil hash codes.

Quiz on Hash Tables

Exercise: Hash Tables